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Abstract 

In this paper we investigate the performance 
of different types of rectified activation func¬ 
tions in convolutional neural network: stan¬ 
dard rectified linear unit (ReLU), leaky rec¬ 
tified linear unit (Leaky ReLU), parametric 
rectified linear unit (PReLU) and a new ran¬ 
domized leaky rectified linear units (RReLU). 

We evaluate these activation function on 
standard image classification task. Our ex¬ 
periments suggest that incorporating a non¬ 
zero slope for negative part in rectified acti¬ 
vation units could consistently improve the 
results. Thus our findings are negative on 
the common belief that sparsity is the key 
of good performance in ReLU. Moreover, on 
small scale dataset, using deterministic neg¬ 
ative slope or learning it are both prone to 
overfitting. They are not as effective as us¬ 
ing their randomized counterpart. By us¬ 
ing RReLU, we achieved 75.68% accuracy on 
CIFAR-100 test set without multiple test or 
ensemble. 

1. Introduction 

Convolutional neural network (CNN) has made great 

success in various computer vision tasks, such as im¬ 
age classification (Krizhevsky et ah, 2012; Szegedy 


ANTINUCLEON@GMAIL.COM 

WINSTY@GMAIL.COM 

TQCHEN@CS.WASHINGTON.EDU 

MULI@CS.CMU.EDU 


et ah, 2014), object detection(Girshick et ah, 2014) 
and tracking(Wang et ah, 2015). Despite its depth, 
one of the key characteristics of modern deep learn¬ 
ing system is to use non-saturated activation function 
(e.g. ReLU) to replace its saturated counterpart (e.g. 
sigmoid, tanh). The advantage of using non-saturated 
activation function lies in two aspects: The first is 
to solve the so called “exploding/vanishing gradient”. 
The second is to accelerate the convergence speed. 

In all of these non-saturated activation functions, the 
most notable one is rectified linear unit (ReLU) (Nair 
& Hinton, 2010; Sun et ah, 2014). Briefly speaking, it 
is a piecewise linear function which prunes the nega¬ 
tive part to zero, and retains the positive part. It has 
a desirable property that the activations are sparse af¬ 
ter passing ReLU. It is commonly believed that the 
superior performance of ReLU comes from the spar¬ 
sity (Glorot et ah, 2011; Sun et ah, 2014). In this 
paper, we want to ask two questions: First, is spar¬ 
sity the most important factor for a good performance? 
Second, can we design better non-saturated activation 
functions that could beat ReLU? 

We consider a broader class of activation functions, 
namely the rectified unit family. In particular, we are 
interested in the leaky ReLU and its variants. In con¬ 
trast to ReLU, in which the negative part is totally 
dropped, leaky ReLU assigns a noon-zero slope to it. 
The first variant is called parametric rectified linear 
unit (PReLU) (He et ah, 2015). In PReLU, the slopes 
of negative part are learned form data rather than pre¬ 
defined. The authors claimed that PReLU is the key 
factor of surpassing human-level performance on Im- 
ageNet classification (Russakovsky et ah, 2015) task. 
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The second variant is called randomized rectified lin¬ 
ear unit (RReLU). In RReLU, the slopes of negative 
parts are randomized in a given range in the training, 
and then fixed in the testing. In a recent Kaggle Na¬ 
tional Data Science Bowl (NDSB) competition 1 , it is 
reported that RReLU could reduce overfitting due to 
its randomized nature. 

In this paper, we empirically evaluate these four kinds 
of activation functions. Based on our experiment, we 
conclude on small dataset, Leaky ReLU and its vari¬ 
ants are consistently better than ReLU in convolu¬ 
tional neural networks. RReLU is favorable due to 
its randomness in training which reduces the risk of 
overfitting. While in case of large dataset, more inves¬ 
tigation should be done in future. 

2. Rectified Units 

In this section, we introduce the four kinds of rectified 
units: rectified linear (ReLU), leaky rectified linear 
(Leaky ReLU), parametric rectified linear (PReLU) 
and randomized rectified linear (RReLU). We illus¬ 
trate them in Fig.l for comparisons. In the sequel, 
we use Xji to denote the input of ith channel in j th 
example , and yji to denote the corresponding output 
after passing the activation function. In the following 
subsections, we introduce each rectified unit formally. 




Leaky ReLU/PReLU 



Wi = ajiXji 

Randomized Leaky ReLU 


Figure 1: ReLU, Leaky ReLU, PReLU and RReLU. 
For PReLU, cq is learned and for Leaky ReLU is 
fixed. For RReLU, aji is a random variable keeps sam¬ 
pling in a given range, and remains fixed in testing. 


2.1. Rectified Linear Unit 


2.2. Leaky Rectified Linear Unit 


Leaky Rectified Linear activation is first introduced in 
acoustic model(Maas et ah, 2013). Mathematically, we 
have 


Vi = 



if Xi > 0 

if Xi < 0, 


( 2 ) 


where ai is a fixed parameter in range (l,+oo). In 
original paper, the authors suggest to set cq to a large 
number like 100. In additional to this setting, we also 
experiment smaller = 5.5 in our paper. 


2.3. Parametric Rectified Linear Unit 

Parametric rectified linear is proposed by (He et ah, 
2015). The authors reported its performance is much 
better than ReLU in large scale image classification 
task. It is the same as leaky ReLU (Eqn.2) with the 
exception that a* is learned in the training via back 
propagation. 


2.4. Randomized Leaky Rectified Linear Unit 

Randomized Leaky Rectified Linear is the randomized 
version of leaky ReLU. It is first proposed and used in 
Kaggle NDSB Competition. The highlight of RReLU 
is that in training process, aji is a random number 
sampled from a uniform distribution U (/ ,u). Formally, 
we have: 

_ J Xji if Xji > 0 

^ \ ii ^ %ji ^ 0 ? 

where 

aji ~ U(l, u), l < u and Z, u G [0,1) (4) 

In the test phase, we take average of all the aji in 
training as in the method of dropout (Srivastava et ah, 
2014) , and thus set aji to l -^ to get a deterministic 
result. Suggested by the NDSB competition winner, 
aji is sampled from U( 3, 8). We use the same configu¬ 
ration in this paper. 

In test time, we use: 

Xji 

y* = i+T 

2 


Rectified Linear is first used in Restricted Boltzmann 
Machines(Nair & Hinton, 2010). Formally, rectified 
linear activation is defined as: 


J Xi if Xi > 0 

\ 0 if Xi < 0. 


(i) 


1 Kaggle National Data Science Bowl Competition: 
https://www.kaggle.com/c/datasciencebowl 


3. Experiment Settings 

We evaluate classification performance on same con¬ 
volutional network structure with different activa¬ 
tion functions. Due to the large parameter search¬ 
ing space, we use two state-of-art convolutional net¬ 
work structure and same hyper parameters for differ¬ 
ent activation setting. All models are trained by using 
CXXNET 2 . 

2 CXXNET : https : //github. com/dmlc/cxxnet 










Empirical Evaluation of Rectified Activations in Convolutional Network 


3.1. CIFAR-10 and CIFAR-100 

The CIFAR-10 and CIFAR-100 dataset (Krizhevsky & 
Hinton, 2009) are tiny nature image dataset. CIFAR- 
10 datasets contains 10 different classes images and 
CIFAR-100 datasets contains 100 different classes. 
Each image is an RGB image in size 32x32. There are 
50,000 training images and 10,000 test images. We use 
raw images directly without any pre-processing and 
augmentation. The result is from on single view test 
without any ensemble. 

The network structure is shown in Table 1. It is taken 
from Network in Network(NIN)(Lin et ah, 2013). 


Input Size 

NIN 

32 x 32 

5x5, 192 

32 x 32 

lxl, 160 

32 x 32 

lxl, 96 

32 x 32 

3x3 max pooling, /2 

16 x 16 

dropout, 0.5 

16 x 16 

5x5, 192 

16 x 16 

lxl, 192 

16 x 16 

lxl, 192 

16 x 16 

3x3,avg pooling, /2 

8x8 

dropout, 0.5 

8x8 

3x3, 192 

8x8 

lxl, 192 

8x8 

lxl, 10 

8x8 

8x8, avg pooling, /I 

10 or 100 

softmax 


Table 1. CIFAR-10/CIFAR-100 network structure. Each 
layer is a convolutional layer if not otherwise specified. Ac¬ 
tivation function is followed by each convolutional layer. 

In CIFAR-100 experiment, we also tested RReLU 
on Batch Norm Inception Network (Ioffe & Szegedy, 
2015). We use a subset of Inception Network which 
is started from inception-3a module. This network 
achieved 75.68% test accuracy without any ensemble 
or multiple view test 3 . 

3.2. National Data Science Bowl Competition 

The task for National Data Science Bowl competition 
is to classify plankton animals from image with award 
of $170k. There are 30,336 labeled gray scale images 
in 121 classes and there are 130,400 test data. Since 
the test set is private, we divide training set into two 
parts: 25,000 images for training and 5,336 images for 
validation. The competition uses multi-class log-loss 
to evaluate classification performance. 

3 CIFAR-100 Reproduce code: https://github. 

com/dmlc/mxnet/blob/master/example/notebooks/ 
cifar-100.ipynb 


We refer the network and augmentation setting from 
team AuroraXie 4 , one of competition winners. The 
network structure is shown in Table 5. We only use 
single view test in our experiment, which is different 
to original multi-view, multi-scale test. 


Input Size 

NDSB Net 

70 x 70 

3x3, 32 

70 x 70 

3x3, 32 

70 x 70 

3x3, max pooling, /2 

35 x 35 

3x3, 64 

35 x 35 

3x3, 64 

35 x 35 

3x3, 64 

35 x 35 

3x3, max pooling, /2 

17 x 17 

split: branchl — branch 2 

17 x 17 

3x3, 96 — 3x3, 96 

17 x 17 

3x3, 96 — 3x3, 96 

17 x 17 

3x3, 96 — 3x3, 96 

17 x 17 

3x3, 96 

17 x 17 

channel concat, 192 

17 x 17 

3x3, max pooling, /2 

8x8 

3x3, 256 

8x8 

3x3, 256 

8x8 

3x3, 256 

8x8 

3x3, 256 

8x8 

3x3, 256 

8x8 

SPP (He et ah, 2014) {1, 2, 4} 

12544 x 1 

flatten 

1024 x 1 

fcl 

1024 x 1 

fc2 

121 

softmax 


Table 2. National Data Science Bowl Competition Net¬ 
work. All layers are convolutional layers if not otherwise 
specified. Activation function is followed by each convolu¬ 
tional layer. 

4. Result and Discussion 

Table 3 and 4 show the results of CIFAR-10/CIFAR- 
100 dataset, respectively. Table 5 shows the NDSB 
result. We use ReLU network as baseline, and com¬ 
pare the convergence curve with other three activa¬ 
tions pairwisely in Fig. 2, 3 and 4, respectively. All 
these three leaky ReLU variants are better than base¬ 
line on test set. We have the following observations 
based on our experiment: 

1. Not surprisingly, we find the performance of nor¬ 
mal leaky ReLU (a = 100) is similar to that of 
ReLU, but very leaky ReLU with larger a = 5.5 
is much better. 

4 Winning Doc of AuroraXie: https://github.com/ 
auroraxie/Kaggle-NDSB 
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2. On training set, the error of PReLU is always the reasons of their superior performances still lack rigor- 
lowest, and the error of Leaky ReLU and RReLU ous justification from theoretic aspect. Also, how the 

are higher than ReLU. It indicates that PReLU activations perform on large scale data is still need to 

may suffer from severe overfitting issue in small be investigated. This is an open question worth pur- 
scale dataset. suing in the future. 


3. The superiority of RReLU is more significant 
than that on CIFAR-10/CIFAR-100. We conjec¬ 
ture that it is because the in the NDSB dataset, 
the training set is smaller than that of CIFAR- 
10/CIFAR-100, but the network we use is even 
bigger. This validates the effectiveness of RReLU 
when combating with overfitting. 

4. For RReLU, we still need to investigate how the 
randomness influences the network training and 
testing process. 


Activation 

Training Error 

Test Error 

ReLU 

0.00318 

0.1245 

Leaky ReLU, a — 100 

0.0031 

0.1266 

Leaky ReLU, a — 5.5 

0.00362 

0.1120 

PReLU 

0.00178 

0.1179 

RReLU ( yji = X ji/ l 4?) 

0.00550 

0.1119 


Table 3. Error rate of CIFAR-10 Network in Network with 
different activation function 


Activation 

Training Error 

Test Error 

ReLU 

0.1356 

0.429 

Leaky ReLU, a — 100 

0.11552 

0.4205 

Leaky ReLU, a — 5.5 

0.08536 

0.4042 

PReLU 

0.0633 

0.4163 

RReLU ( Vji = Xji/ 1 -^) 

0.1141 

0.4025 


Table 4. Error rate of CIFAR-100 Network in Network with 
different activation function 


Activation 

Train Log-Loss 

Val Log-Loss 

ReLU 

0.8092 

0.7727 

Leaky ReLU, a = 100 

0.7846 

0.7601 

Leaky ReLU, a — 5.5 

0.7831 

0.7391 

PReLU 

0.7187 

0.7454 

RReLU ( yji = 

0.8090 

0.7292 


Table 5. Multi-classes Log-Loss of NDSB Network with dif¬ 
ferent activation function 

5. Conclusion 
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Figure 2: Convergence curves for training and test sets of different activations on CIFAR-10 Network in Network. 






Figure 3: Convergence curves for training and test sets of different activations on CIFAR-100 Network in Network. 






Figure 4: Convergence curves for training and test sets of different activations on NDSB Net. 
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